Skip to main content

Running NVFLARE locally and debugging

As NVFLARE is distributed framework and also is still a work in progress (WIP) technology, it is necessary to be able to run it in a simulated setting on a single computer or to debug it locally.

While this setup allows you to run NVFLARE locally, it is not suitable or recommended for entire experiments.

When to use NVFLARE local runs and debugging

Before submitting the scripts

Restarting VM, merging code, submitting all the VMs etc. can last around half an hour. In such a case, having a smaller, easier to run usecase can lead to much faster development.

Typos or missing libraries are all very frustrating errors, which lead to prolonged deployment. Simulator is a great tool to test your script before running it in Azure, as it can find all the classical python related errors.

Rewritting code back and forth to clear pytorch, just in order to debug it, may not be the most time-effective development approach. This way, one is able to debug it in pseudo-distributed way.

Figuring out environment settings

We want to provide some stable environments, where NVFLARE should run without any problems. However, sometimes libraries don't work nicely (conflicting dependencies), which usually manifest as configuration errors by NVFlare. Running NVFLARE simulator locally can identify such a problems and allows to play around with an environment.

Tips and Tricks

  1. Always use just a subset of data

How to run NVFlare locally

In order to run NVFLARE on a single computer, so-called FL Simulator is provided. NVFLARE Simulator should be launchable from any conda or venv environment, where NVFLARE is installed. You can of course install dependencies into system without venv or conda, but it will complicate any reinstalations or upgrades of packages, to the point, where it will be easier to tear down whole python.

If you are working with Windows, you will need a WSL system. For reference, please see WSL Setup guide.

Once you have WSL installed / you have some Linux distribution running on your machine, it is possible to run the appliciation.

usage: nvflare simulator [-h] -w WORKSPACE [-n N_CLIENTS] [-c CLIENTS] [-t THREADS] [-gpu GPU] job_folder

positional arguments:
job_folder

Let\'s take a look at the possible parameters:

job_folder - classic NVFlare folder with custom and config folders, which would be submitted to clients via admin

workspace - folder where client configs should be generated, models saved... You can use some temporary folder to do this, if you don't need the results

n_clients - number of clients

clients - comma separated list of clients

threads - number of clients running in paralel

gpu - list of GPUs to be used (if multiple), if omitted, clients will run only on CPU

max_clients - by default, you can use 100 clients max, if you need to change this value, use this parameter

How to debug NVFlare locally

Once you have WSL installed / you have some Linux distribution running on your machine, it is also possible to debug the appliciation. However, with WSL setup, there is another possible caveat - in order to use debugger from within IDE, it is needed to setup a remote debugging. For best experience, we advise to go with Visual Studio Code, as it seems that there are some problems with remote debugging inside WSL for PyCharm. For more information about setting remote debugger in VS Code with WSL, please see this resource.

Now you need to add this script, provided by NVIDIA in documentation, to your code:

def define_simulator_parser(simulator_parser):
simulator_parser.add_argument("job_folder")
simulator_parser.add_argument("-w", "--workspace", type=str, help="WORKSPACE folder")
simulator_parser.add_argument("-n", "--n_clients", type=int, help="number of clients")
simulator_parser.add_argument("-c", "--clients", type=str, help="client names list")
simulator_parser.add_argument("-t", "--threads", type=int, help="number of parallel running clients")
simulator_parser.add_argument("-gpu", "--gpu", type=str, help="list of GPU Device Ids, comma separated")
simulator_parser.add_argument("-m", "--max_clients", type=int, default=100, help="max number of clients")


def run_simulator(simulator_args):
simulator = SimulatorRunner(
job_folder=simulator_args.job_folder,
workspace=simulator_args.workspace,
clients=simulator_args.clients,
n_clients=simulator_args.n_clients,
threads=simulator_args.threads,
gpu=simulator_args.gpu,
max_clients=simulator_args.max_clients,
)
run_status = simulator.run()

return run_status


if __name__ == "__main__":
"""
This is the main program when running the NVFlare Simulator. Use the Flare simulator API,
create the SimulatorRunner object, do a setup(), then calls the run().
"""

if sys.version_info < (3, 7):
raise RuntimeError("Please use Python 3.7 or above.")

parser = argparse.ArgumentParser()
define_simulator_parser(parser)
args = parser.parse_args()
status = run_simulator(args)
sys.exit(status)
```

Now you can run the script with same parameters as nvflare simulator above and NVFLARE execution should trigger breakpoints.

For more information and advanced usecases, please see NVIDIA documentation.

Debugging online deployment

It is also possible, to some extend to debug the federated task. It is not however debugging with breakpoint, to find a flaws in training code. This debugging is more aimed at finding errors in communication and federation itself.

Modify server container

In order to debug the server, it is easiest to include

ENV FL_LOG_LEVEL=DEBUG

to the end of the Dockerfile for server environment.

Modify client

Easiest way to understand what is happening on the client is to modify the file /workspace/FL-mvp/prod_00/client_name/local/log.config or log.config.default for the client and change logger_root from:

[logger_root]
level=INFO
handlers=consoleHandler

to

[logger_root]
level=DEBUG
handlers=consoleHandler